“R-omics”: Using R for Bioinformatics

R-Ladies MTL October 2023 Meetup

2023-10-30

R-Ladies MTL October 2023 Meetup

Welcome! Today’s agenda:

6:00 - 6:15: Settling in

6:15 - 6:30: Introductions and call for presenters!

6:30 - 7:30: “R-omics”: Using R for bioinformatics

- Includes 20+ min break for exercises 

7:30 - 8:00: R-help/networking period

R-Ladies

R Ladies Global seeks to achieve proportionate representation by encouraging, inspiring, and empowering women and gender minorities currently underrepresented in the R community

We focus on the R language/environment (but if you want to talk about python or SAS, you won’t be banned)

All members must follow the RLadies Code of Conduct

  • Community driven, independent, free of charge
  • Safe and friendly, zero harassment!

Introductions: Kathryn Morrison

Introductions: Monica Cella

Introductions: Rhiannon Kamstra

What is bioinformatics?

Data sources

  • Often complex, high-throughput technology
  • “-omics” examine large pools of biological molecules

Tools for biological data

  • Bioinformatics is broad and rapidly evolving

  • Selection of tools and/or languages depends on application

  • Applications with a Graphical User Interface are common (e.g., Clustal for alignment)

  • As are command-line tools

R in bioinformatics

  • R is open-source and vastly extendable via packages
  • Data wrangling, exploration, analysis, and visualization all in once place
  • Supports reproducible science and knowledge sharing
    • Workflows can be scaled and shared by writing scripts or packages

Bioconductor

  • Started in 2001
  • 2nd largest R package repository after the CRAN
    • ~2.3k packages in v3.18!
  • Focused on tools for analyzing data from biological assays

Installing Bioconductor packages

  1. Install Bioconductor’s package manager

    if (!require("BiocManager", quietly = TRUE))
        install.packages("BiocManager")
  2. Search for available packages

    # List all packages
    BiocManager::available()
    # List only packages containing "sapiens" in the name
    BiocManager::available(pattern = "sapiens")

    You can also use https://www.bioconductor.org/packages/.

  1. Install Bioconductor packages

    # Install a specific package, e.g. Biostrings
    BiocManager::install("Biostrings")
    # Install all core Bioconductor packages (may take a while!)
    BiocManager::install()

    Note: avoid using install.packages() to ensure you get the most up-to-date version

Working with raw sequence data

  • DNA, RNA, and amino acid (protein) sequences can be represented as strings
# Letters represent bases G, C, T, A
dna <- "TAGCAG"
  • Specialized object types can add functionality and improve performance (e.g., DNAString, AAString)
# DNAString objects are used for DNA sequences
Biostrings::DNAString(dna)
6-letter DNAString object
seq: TAGCAG

Use objects designed to work with multiple sequences

  • Multiple sequences can be stored in a single vector
# DNAStringSet objects contain multiple DNAString values
dna3 <- c("TAGCAG", "CCATTG", "AAG")
Biostrings::DNAStringSet(dna3)
DNAStringSet object of length 3:
    width seq
[1]     6 TAGCAG
[2]     6 CCATTG
[3]     3 AAG

Base R functions are often still useful

  • Check the number of values/sequences
more_dna <- Biostrings::DNAStringSet(c("GC", "TA"))
length(more_dna)
  • Add/extract elements (sequences)
# Extract the 2nd sequence
more_dna[2]
# Add a 3rd sequence
more_dna[3] <- "AGGTAG"
Example functions for working with sequences
Biostrings Function Purpose
translate() Translate DNA/RNA to amino acids

complement()

reverseComplement()

Get the (reversed) complementary sequence of base pairs
alphabetFrequency() Count the frequency of each letter
matchPattern() Finds matching occurrences of a pattern

Exercises

  1. Install the Bioconductor package Biostrings .

  2. Store at least 2 DNA sequences in a DNAStringSet object.

  3. Translate your DNA into amino acids.

    Bonus 1: Try incorporating ambiguous bases in your DNA. Test different if.fuzzy.codon options of translate().

  4. Count the number of methionine (M) residues in each sequence.

    Bonus 2: Summarize the distribution across all of your sequences.

https://www.ddbj.nig.ac.jp/ddbj/code-e.html

Working with special file formats

  • Bioinformatics data often come in specialized file formats

  • Common sequence data formats include:

    • FASTA

    • FASTQ - contains quality information

    • BAM

    • SAM - contains alignment to reference sequence

  • samtools, seqinr, ShortReads are helpful packages for working with files

Command-line tools

  • Many useful tools are used from the command-line

    blastp -query aa_seq.fasta -remote -db nr -out results.txt -outfmt 6 -evalue 1e-30

First 10 lines of output:

# BLASTP 2.9.0+ 
# Query: sars_partial
# RID: KZSA0UTY016
# Database: nr  
# Fields: query acc.ver, subject acc.ver, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
# 500 hits found
sars_partial    7WHC_A  100.000 70      0       0       1       70      1       70      3.24e-44        151
sars_partial    7UJU_A  100.000 70      0       0       1       70      2       71      3.52e-44        151
sars_partial    7UJ9_A  100.000 70      0       0       1       70      1       70      4.77e-44        151

Build and execute calls from R

  • The same call from R
aa_seq <- "SGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDVVYCPRHVICTSEDMLNPNYEDLLIRKSNHNFLVQA"
names(aa_seq) <- "sars_partial"
query_file <- "aa_seq.fasta"
out_file <- "results.txt"

# Write sequence as a FASTA file
seqinr::write.fasta(aa_seq, names(aa_seq), file.out = my_file)

# Construct call
blast_call <- paste0("blastp -query ", query_file, " -remote -db nr -out ", out_file, " -outfmt 7 -evalue 1e-30")

# Execute
system(blast_call)

Note: Command-line tools (e.g., blastp) need to be installed and configured before use.

High-dimensional data

  • E.g., gene expression data from microarrays or RNAseq

    • Quantity of expression (e.g., # of transcripts) x 1,000s of genes (features)
  • Presents special challenges for analysis and visualization

    • E.g., batch effects, multiple testing, handling large datasets
  • Spatial biology platforms map gene expression over tissue coordinates

Demo: Mapping gene expression in a mouse brain

10X Visium counts mRNA for each gene in each 50 micron spot on the grid

Demo: Mapping gene expression in a mouse brain